Goto

Collaborating Authors

 critic loss




From Rules to Rewards: Reinforcement Learning for Interest Rate Adjustment in DeFi Lending

arXiv.org Artificial Intelligence

Decentralized Finance (DeFi) lending enables permission-less borrowing via smart contracts. However, it faces challenges in optimizing interest rates, mitigating bad debt, and improving capital efficiency. Rule-based interest-rate models struggle to adapt to dynamic market conditions, leading to inefficiencies. This work applies Offline Reinforcement Learning (RL) to optimize interest rate adjustments in DeFi lending protocols. Using historical data from Aave protocol, we evaluate three RL approaches: Conservative Q-Learning (CQL), Behavior Cloning (BC), and TD3 with Behavior Cloning (TD3-BC). TD3-BC demonstrates superior performance in balancing utilization, capital stability, and risk, outperforming existing models. It adapts effectively to historical stress events like the May 2021 crash and the March 2023 USDC depeg, showcasing potential for automated, real-time governance.


Reviews: Learning to Learn By Self-Critique

Neural Information Processing Systems

Summary: This paper considers few-shot classification and seeks to make use of the unlabeled query data during few-shot classification by training on it with a meta-learned critic loss. The algorithm builds on top of MAML, and has two stages. In the first stage, the model is adapted via gradient descent on the labeled support set. In the second stage, the model is further adapted via a meta-learned critic loss that is a function of a featurization of the model parameters and the unlabeled query set. Originality: The proposed approach strikes me as quite similar to One-Shot Imitation Learning by Domain-Adaptive Meta-Learning (Yu et al. 2018).


Review for NeurIPS paper: Reciprocal Adversarial Learning via Characteristic Functions

Neural Information Processing Systems

Weaknesses: My primary concern is that: 0. The paper seems to propose two ideas: 1) measuring distance between distributions as an expected squared difference between empirical characteristic functions evaluated at points sampled according to some adversarially learned distribution T; 2) the reciprocal training of adversarial autoencoders, i.e. adversarially aligning embeddings of X and Y, while making sure that these embeddings follow the Gaussian distribution and minimize the reconstruction loss. I wonder whether the impact of these two design choices can be evaluated independently: 1) seeing how direct minimization of C_T(X, g(Z)) wrt g performs compared to the model with a dedicated encoder/critic; 2) replacing C_T in Algorithm 1 with MMD / Sliced Wasserstein Distance or another statistical distance (moreover, distance to a Gaussian can often be estimated in closed form); does Lemma 4 hold for other statistical distances? And there are some things that I must have misunderstood. In general, authors discuss in great details possible interpretations of phase and amplitude components of CFs, but cram a lot of content critical to proper understanding of the final model on the first half of page 6. For example, in lines 214-215: "we further re-design the critic loss by finding an anchor as C(f(Y),Z) C(f(X),Z)" - it is still not clear to me what "anchors" authors are referring to.


Improved Training of Wasserstein GANs

Neural Information Processing Systems

Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only poor samples or fail to converge. We find that these problems are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to undesired behavior. We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input. Our proposed method performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models with continuous generators. We also achieve high quality generations on CIFAR-10 and LSUN bedrooms.


Learning Progress Driven Multi-Agent Curriculum

arXiv.org Artificial Intelligence

Curriculum reinforcement learning (CRL) aims to speed up learning by gradually increasing the difficulty of a task, usually quantified by the achievable expected return. Inspired by the success of CRL in single-agent settings, a few works have attempted to apply CRL to multi-agent reinforcement learning (MARL) using the number of agents to control task difficulty. However, existing works typically use manually defined curricula such as a linear scheme. In this paper, we first apply state-of-the-art single-agent self-paced CRL to sparse reward MARL. Although with satisfying performance, we identify two potential flaws of the curriculum generated by existing reward-based CRL methods: (1) tasks with high returns may not provide informative learning signals and (2) the exacerbated credit assignment difficulty in tasks where more agents yield higher returns. Thereby, we further propose self-paced MARL (SPMARL) to prioritize tasks based on \textit{learning progress} instead of the episode return. Our method not only outperforms baselines in three challenging sparse-reward benchmarks but also converges faster than self-paced CRL.


Decision-Aware Actor-Critic with Function Approximation and Theoretical Guarantees

arXiv.org Artificial Intelligence

Actor-critic (AC) methods are widely used in reinforcement learning (RL) and benefit from the flexibility of using any policy gradient method as the actor and value-based method as the critic. The critic is usually trained by minimizing the TD error, an objective that is potentially decorrelated with the true goal of achieving a high reward with the actor. We address this mismatch by designing a joint objective for training the actor and critic in a decision-aware fashion. We use the proposed objective to design a generic, AC algorithm that can easily handle any function approximation. We explicitly characterize the conditions under which the resulting algorithm guarantees monotonic policy improvement, regardless of the choice of the policy and critic parameterization. Instantiating the generic algorithm results in an actor that involves maximizing a sequence of surrogate functions (similar to TRPO, PPO) and a critic that involves minimizing a closely connected objective. Using simple bandit examples, we provably establish the benefit of the proposed critic objective over the standard squared error. Finally, we empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems.


A Connection between One-Step Regularization and Critic Regularization in Reinforcement Learning

arXiv.org Artificial Intelligence

As with any machine learning problem with limited data, effective offline RL algorithms require careful regularization to avoid overfitting. One-step methods perform regularization by doing just a single step of policy improvement, while critic regularization methods do many steps of policy improvement with a regularized objective. These methods appear distinct. One-step methods, such as advantage-weighted regression and conditional behavioral cloning, truncate policy iteration after just one step. This ``early stopping'' makes one-step RL simple and stable, but can limit its asymptotic performance. Critic regularization typically requires more compute but has appealing lower-bound guarantees. In this paper, we draw a close connection between these methods: applying a multi-step critic regularization method with a regularization coefficient of 1 yields the same policy as one-step RL. While practical implementations violate our assumptions and critic regularization is typically applied with smaller regularization coefficients, our experiments nevertheless show that our analysis makes accurate, testable predictions about practical offline RL methods (CQL and one-step RL) with commonly-used hyperparameters. Our results that every problem can be solved with a single step of policy improvement, but rather that one-step RL might be competitive with critic regularization on RL problems that demand strong regularization.


Modeling Recommendation Systems as Reinforcement Learning Problem

#artificialintelligence

In this era, a massive volume of information is available to the users through web which leads to information overload. The Recommender systems are used to facilitate the search through this vast space of items by giving user personalised services and items. The vast majority of traditional recommendation systems consider the recommendation procedure as a static process and make recom- mendations following a fixed strategy. A user interacts with recommendation engine in a sequence of exchanges of recommendations and provides feedback on them. Hence, we should also try to incorporate the feedback ofthe user at each time step while recommending items at the next time step.